A pairwise strategy for imputing predictive features when combining multiple datasets

Abstract Motivation In the training of predictive models using high-dimensional genomic data, multiple studies’ worth of data are often combined to increase sample size and improve generalizability. A drawback of this approach is that there may be different sets of features measured in each study due to variations in expression measurement platform or technology. It is often common practice to work only with the intersection of features measured in common across all studies, which results in the blind discarding of potentially useful feature information that is measured in individual or subsets of studies. Results We characterize the loss in predictive performance incurred by using only the intersection of feature information available across all studies when training predictors using gene expression data from microarray and sequencing datasets. We study the properties of linear and polynomial regression for imputing discarded features and demonstrate improvements in the external performance of prediction functions through simulation and in gene expression data collected on breast cancer patients. To improve this process, we propose a pairwise strategy that applies any imputation algorithm to two studies at a time and averages imputed features across pairs. We demonstrate that the pairwise strategy is preferable to first merging all datasets together and imputing any resulting missing features. Finally, we provide insights on which subsets of intersected and study-specific features should be used so that missing-feature imputation best promotes cross-study replicability. Availability and implementation The code is available at https://github.com/YujieWuu/Pairwise_imputation. Supplementary information Supplementary information is available at Bioinformatics online.


Notations
Descriptions s study index X s gene profiling dataset for study s n s number of observations in study s p s number of genes in study s p sj number of genes common in study s and study j X s, Cs1 subset of X s , containing genes that are common across all studies X s, Cs2 subset of X s , containing genes that are specific to study s G sj genes that are common across study s and study j G s/sj genes that are specific to study s, but missing in study j T 1 , T 2 training sets 1 and 2 V External validation set H 1 genes that are common in V and the top n genes among T 1 , T 2 H 2 union of genes in the top n genes in T 1 and T 2 , but not in H 1 H union of the top n genes in T 1 and T 2 H c union of the remaining genes (i.e., not in the top n genes) in T 1 and T 2 H int intersection of all genes across T 1 , T 2 and V We illustrate the 'Core' imputation on three studies, in which there are two training sets, s = 1, 2 and one validation set s = 3. Here, we make the assumption that the response variable is not available in the validation set. For each study, suppose there are ps genes, among which qs genes are predictive of the outcome (we refer to these genes as signals), and ps − qs genes are irrelevant to the outcome (we refer to them as noise). Due to the mixture of signals and noise, a common practice is to filter out genes that are most related to the outcome. Consider, for example, if in each training study we select the top 7 genes with the largest magnitudes of coefficient estimates from LASSO, where the response is the outcome and the predictors are the expression values of the genes. Table S2 provides such an illustrative example, where for notational convenience, we denote the two training sets as T 1 and T 2 and the validation set as V . Let H1 = {A, B, C} denote the intersected genes among the top 7 genes across the two training sets and all of the available genes in the validation set, H2 = {D, E, F, G, H, J, K} denote the union of top 7 genes that are not shared by all studies, and H = H1 ∪ H2.
The 'Core' imputation method uses the genes in H1 to impute the study-specific missing genes in H2. To perform the 'Core' imputation, three possible scenarios require consideration, and we introduce them using the specific example in Table S2. 1. Gene D in H2 is found among the top 7 genes of T 1, but is missing in V and the top 7 genes in T 2. Therefore, we will build an imputation model for gene D using genes in H1 based on the data from T 1, and impute the expression value for gene D in T 2 and V .
1 Table S2: Simple example for demonstrating 'Core' Imputation method. The 'Intersected' column contains the intersection of the genes in V and the top 7 genes in T 1 and T 2 ; 'Non-intersected' column is for T 1 and T 2 , containing the top 7 genes in T 1 and T 2 that are not shared by all studies.
Top 7 genes for modelling Remaining genes Data set Intersected Non-intersected Not used in 'Core' Imputation Gene E in H2 is found in V and among the top 7 genes of T 1, but is missing in T 2. Therefore, the imputation model will be built for gene E using genes in H1 based on data from T 1, and the expression value for gene E will be imputed for T 2. Note that even though gene E is also available in the validation set, we will not incorporate it to build the imputation model since V is only used for validation purposes.
3. Gene G in H2 is found among the top 7 genes of T 1 and T 2, but is missing in V . Therefore, we need to impute gene G for V . To fully use the data, we will merge T 1 and T 2 together and build the imputation model for gene G based on the genes in H1 from these two training sets.
Genes such as M and N in V are only available in V and are missing in the top 7 genes of T 1 and T 2. Since V is used for validation, we will not impute Gene M, N for T 1 and T 2. If we have S, S > 2, training sets, we can repeat the above procedure for all possible S 2 pairs of training sets combined with the additional validation set V , and if a gene is imputed multiple times, we take the average over the multiple imputed values as the final imputation.

S2.2 'All' imputation
Denote H c as the union of genes that do not belong to the top 7 genes of the two training studies (i.e. the union of genes in the 'Remaining genes' column of Table S3 for T 1 and T 2), and Hint as the intersected genes of all available genes of the three studies. The 'All' imputation method is to use the genes in Hint to impute the study-specific missing genes in H2 instead of using only the intersection of the top predictive genes (i.e. genes in H1). To complete the 'All' imputation, four possible scenarios need considering.
1. Gene E in H2 is found in V and the top 7 genes of T 1, but is missing in T 2. Therefore, the imputation model will be built for gene E using genes in Hint based on data from T 1, and the expression value of gene E will be imputed for T 2. Note that even though gene E is also available in the validation set, we will not incorporate it to build the imputation model since V is only used for validation purposes.
2. Gene J in H2 is found in V and the top 7 genes of T 2, but not in the top 7 genes of T 1. However, it is available in H c for T 1. Therefore, we can directly use the expression value of gene J for T 1, and no imputation is needed in this case.
3. Gene K in H2 is found among the top 7 genes of T 2, but not in V and the top 7 genes of T 1. However, it is available in H c for T 1. Therefore, we can directly use the expression value of gene K for T 1. Since gene K is completely missing in V , we will merge T 1 and T 2 together and build the imputation model for gene K based on the genes in Hint, and then impute the missing gene K value in V .
4. Gene D in H2 is found in T 1, but is completely missing in T 2 and V . Therefore, we will build an imputation model for gene D using genes in Hint based on data from T 1, and impute the missing gene D value in T 2 and V . Table S3: Simple example for demonstrating 'All' Imputation method. The 'Intersected' column contains the intersection of the genes in V and the top 7 genes in T 1 and T 2 ; 'Non-intersected' column is for T 1 and T 2 , containing the top 7 genes in T 1 and T 2 that are not shared by all studies; 'Remaining genes' column contains the other genes in T 1 and T 2 that are not in the top 7 gene list.
Top 7 genes for modelling Remaining genes Data set Intersected Non-intersected -T 1 A, B, C D, E, F, G J, K. . .

S2.3 Algorithm statement
Algorithm 1: 'Core' Imputation Data: Ti, Tj (training sets); V (Validation set) Result: Ti, Tj, V (Imputed training and validation sets) Initialization: Q V : All the genes in V ; select top q predictive genes in each training study: Qi, Qj. Let if gene k / ∈ Qi ∩ Qj, and gene k / ∈ Q V then Imputation model: lm(gene k ∼ genes ∈ H1, data = TiI(gene k ∈ Qi) + TjI(gene k ∈ Qj)), and impute for the other two studies end if gene k / ∈ Qi ∩ Qj, and gene k ∈ Q V then Imputation model: lm(gene k ∼ genes ∈ H1, data = TiI(gene k ∈ Qi) + TjI(gene k ∈ Qj)), and impute for the training set having gene k missing. Validation set retain the original values of Gene k end if gene k ∈ Qi ∩ Qj but gene k / ∈ Q V then Imputation model: lm(gene k ∼ genes ∈ H1, data = Ti + Tj), and impute for V end end Algorithm 2: 'All' Imputation Data: Ti, Tj (training sets); V (Validation set) Result: Ti, Tj, V (Imputed training and validation sets) Initialization: Q V : All the genes in V ; select top q predictive genes in each training study: Qi, Qj, and let Q c i , Q c j be the other existing genes of each training study.
if gene k / ∈ H c and gene k / ∈ Q V then Imputation model: lm(gene k ∼ genes ∈ Hint, data = TiI(gene k ∈ Qi) + TjI(gene k ∈ Qj)), and impute for the other two studies end if gene k / ∈ H c but gene k ∈ Q V then Imputation model: lm(gene k ∼ genes ∈ Hint, data = TiI(gene k ∈ Qi) + TjI(gene k ∈ Qj)), and impute for the other training set. V retains the original values for gene k end if gene k ∈ H c but gene k ∈ Q V then All studies use the original values of gene k , no imputation needed end if gene k ∈ H c but gene k / ∈ Q V then Imputation model: lm(gene k ∼ genes ∈ Hint, data = Ti + Tj), and impute for V . end end

S3 Code availability
Code for implementing the proposed methods is available at: https://github.com/YujieWuu/Pairwise_imputation Algorithm 3: Pairwise Imputation Data: T1, . . . , TS (Training sets) and V (Validation set) Result: T1, . . . , TS, V (Imputed datasets) Initialization: Form all possible S 2 pairs of training sets (T i, T j ), 1 ≤ i < j ≤ S for each pair of training sets do Run either the 'Core' or 'All' imputation algorithms. end For each study, if a gene was imputed multiple times, take the simple average of the imputed gene expression values as the final imputation.

Algorithm 4: Merged Imputation
Data: T1, . . . , TS (Training sets) and V (Validation set) Result: T1, . . . , TS, V (Imputed datasets) Initialization: Q V : All the genes in V ; select top q predictive genes in each training study: Q1, . . . , QS.   Figure S1: Log RMSE ratio of prediction RMSE on the validation set. (a) Comparing the RMSE between different merged, pairwise imputation methods with omitting method. (b) Comparing the RMSE between pairwise linear and polynomial imputation models with the corresponding merged imputation methods. The cross points represent the difference in the average number of intersected variables that can be used to impute the study-specific missing variables between the merged and pairwise imputation methods.  Figure S3: Simulation results for the scenario when each study has irrelevant genes of the outcome, and the validation set is also incomplete. (a) Comparing the RMSE of prediction on the validation set between 'Core' and 'All' imputation methods with omitting method. (b) Comparing the RMSE from the 'Core' imputation method to the 'All' imputation method. The cross points represent the difference in the average number of irrelevant genes that have non-zero coefficients in the final predicting model between the 'Core' and 'All' imputation methods.  Figure S4: Simulation results for the scenario when X * 's were generated as sine and cosine functions of X's. (a) RMSE of prediction on the validation set for the Omitting, 'Core' and 'All' imputation method across the 300 simulation replicates. Left panel: β 1 = . . . , = β 20 = 5, β * 1 = . . . , = β * 10 = 10; Right panel: β 1 = . . . , = β 20 = 10, β * 1 = . . . , = β * 20 = 5; (b, c) Pairwise paired Wilcoxon test on RMSE between Omitting, 'Core' and 'All' imputation methods for scenarios when X * 's have larger and smaller coefficients than X's, respectively.  Figure S5: Simulation results for the scenario when X * 's were generated as sine and cosine functions of X's. (a) Comparing the RMSE of prediction on the validation set between 'Core' and 'All' imputation methods with omitting method. (b) Comparing the RMSE from the 'Core' imputation method to the 'All' imputation method. The cross points represent the difference in the average number of irrelevant genes that have non-zero coefficients in the final predicting model between the 'Core' and 'All' imputation methods. Linear.3studies Polynomial.2studies Polynomial.3studies Figure S7: Comparison of prediction RMSE by imputation across two or three studies at a time. The data generation mechanism follows equation (1) where all covariates in the data set are predictive of the outcome. Linear.3studies Polynomial.2studies Polynomial.3studies Figure S8: Comparison of prediction RMSE by imputation across two or three studies at a time. The data generation mechanism follows equation (2) where additional 10 noises are introduced in the data set.  Figure S12: Include different number of training sets. From left to right, there are 3, 6, and 9 training sets, respectively. The data generation mechanism follows equation (2) where additional 10 noises are introduced in the data set.  Figure S13: Add study heterogeneity in the X − X * relationship. The baseline method for comparison is the Merged imputation approach.
Figure S14: Add study heterogeneity in the X − X * relationship and the noises. The baseline method for comparison is Omitting. Figure S15: Add study heterogeneity in the X − X * relationship and the noises. The baseline method for comparison is the Merged imputation approach.  Figure S19: Seven studies are treated as training sets and one study as the external validation testing set. Note that as we take the intersection between the top n genes across the 7 training sets and the full gene list in the testing set, when n is small (e.g. 30 or 50), we may end up with an empty intersected gene set.  Figure S20: Run time of the linear pairwise imputation and merged imputation strategies when data is generated following Equation (1).